scRNA-seq analysis of Human Substantia Nigra

Daniele Colombo

2021-07-30

Introduction

In this report, a scRNA-Seq analysis is described. The sample used in the analysis comes from human substantia nigra, and it’s one of the datasets studied in “Single-cell multi-omic integration compares and contrasts features of brain cell identity” (Welch et al. 2019) to evaluate the LIGER algorithm for modeling single-cell datasets. The dataset is stored in PanglaoDB (SRA850958) and was obtained through Illumina NovaSeq 6000 and the 10X chromium protocol. The main steps for the scRNA-Seq analysis follows the Seurat vignette containing the clustering tutorial.

Used packages

The main packages used for this analysis are from the following attached packages:

library(Seurat)
library(dplyr)
library(patchwork)
library(tidyverse)
library(umap)

Data loading and preprocessing

After downloading the data from PanglaoDB, it is imported in R and only the gene names are kept as row names:

data <- get(load("SRA850958_SRS4386111.sparse.RData"))
rownames(data) <- gsub("_ENS.*", "", rownames(data))

The loaded data is converted into a SeuratObject in order to then proceed with the analysis steps:

sn <- CreateSeuratObject(counts = data, project = "s. nigra", min.cells = 3, min.features = 200)
sn
An object of class Seurat 
27297 features across 6909 samples within 1 assay 
Active assay: RNA (27297 features, 0 variable features)

Quality control and filtering

The main parameters that are checked for quality control are:

  • Percentage of mitochondrial reads: cells with a too high percentage of mt reads need to be removed, as this could be an index of damaged cells.
  • Unique gene counts (number of features): cells with either too little or too much genes expressed need to be removed, as they may be low quality cells or doublets respectively.
  • Read counts: similar to gene counts.

While the number of genes and reads is automatically computed in the SeuratObject, the percentage of mitochondrial reads needs to be computed and added as metadata:

sn[["percent.mt"]] <- PercentageFeatureSet(sn, pattern = "^MT-")  # compute percentage of mt genes
sn@meta.data

In order to evaluate the distribution of the aforementioned variables, the following plots are produced:

VlnPlot(sn, features = "nFeature_RNA", pt.size = 0)
VlnPlot(sn, features = "nCount_RNA", pt.size = 0)
VlnPlot(sn, features = "percent.mt", pt.size = 0, y.max = 15)

By observing the data, some thresholds are applied to remove cells that have:

  • A number of unique genes expressed lower than 200 or higher than 3500;
  • A percentage of mitochondrial genes higher than 8.
FeatureScatter(sn, feature1 = "nCount_RNA", feature2 = "percent.mt") + geom_hline(yintercept = 8,
    colour = "red")
FeatureScatter(sn, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") + geom_hline(yintercept = 3500,
    colour = "red") + geom_hline(yintercept = 200, colour = "red")

sn <- subset(sn, subset = nFeature_RNA > 200 & nFeature_RNA < 3500 & percent.mt <
    8)
sn
An object of class Seurat 
27297 features across 6644 samples within 1 assay 
Active assay: RNA (27297 features, 0 variable features)

6644 cells are left after filtering.

Normalization

After filtering cells the data is normalized with a global scaling normalization. The default normalization of Seurat (a log-normalization) normalizes the feature expression measurements for each cell by the total expression, multiplies it by a scale factor (10,000 by default), and log-transforms the result.

sn <- NormalizeData(sn)

Feature selection

For downstream analysis, only a subset of the most variable (and therefore most informative) genes is selected. The top 2000 variable features are selected:

sn <- FindVariableFeatures(sn, selection.method = "vst", nfeatures = 2000)
top10 <- head(VariableFeatures(sn), 10)
paste("The top 10 variable genes are:", paste(top10, collapse = ", "))
[1] "The top 10 variable genes are: SYT1, SYNPR, TAC1, PCP4, RGS5, NDUFA4L2, CHI3L1, SNAP25, HBB, MEG3"

The selected features are shown in red in the following plot, and the top 10 variable ones are labeled with their corresponding gene symbol:

plot <- VariableFeaturePlot(sn)
LabelPoints(plot = plot, points = top10, repel = TRUE, xnudge = 0, ynudge = 0)

Data scaling

The counts are then scaled in order to have:

  • Mean expression of each gene equal to 0
  • Variance of the expression of each gene equal to 1

In this way, all the data will have a tendency to a ternary signature: for each cell, a gene can be considered as up-regulated (counts > 0), down-regulated (counts < 0) or with average expression (counts = 0). This scaling is necessary for the following steps of dimensionality reduction and clustering.

sn <- ScaleData(sn, features = rownames(sn))

In the scaling step, it can be possible to regress out some variables from the result, in order to remove some biases that may be caused by them. This step is usually used to remove unwanted variability between cells that is caused by the cell cycle effect. In order to evaluate if this step is necessary:

  • Each cell is assigned to its predicted cell cycle phase based on the expression of phase specific genes;
  • The cells are then plotted in a lower dimensionality space to evaluate if they cluster based on the predicted cell cycle.
# Cell phase prediction
s.genes <- cc.genes$s.genes
g2m.genes <- cc.genes$g2m.genes
sn <- CellCycleScoring(sn, s.features = s.genes, g2m.features = g2m.genes, set.ident = TRUE)

# Linear dimensionality reduction with PCA
sn <- RunPCA(sn, features = VariableFeatures(object = sn))

# Visualization in a lower dimensionality space
DimPlot(sn, reduction = "pca")

From the previous plot, we can observe that there seems to be no clear separation between cells based on cell cycle phase. In this case, it’s not necessary to regress out cell cycle.

Dimensionality determination

In order to perform an effective clustering to classify cells, an appropriate number of principal components needs to be chosen. The choice is made so that the number of PC used is the minimum one that allows to describe the data variability as completely as possible. In order to evaluate which could be the best number of PC, the elbow plot is used, which plots the standard deviation explained by each component:

ElbowPlot(sn, ndims = 50) + geom_vline(colour = "red", xintercept = 18)

As represented by the red line, the optimal number of PC seems to be 18, as after that number there is no significant variation in the explained variability (so most probably the following principal components don’t add much information).

Even if 18 seems to be the best choice, different numbers of principal components were also used for the following clustering steps. In the end, 18 seemed to give the best results.

UMAP Clustering

Seurat uses a kNN graph-based clustering using as metric the euclidean distance in the chosen PC space. To cluster the cells, two steps are applied:

  • First, the kNN graph is built with the function FindNeighbors(), to which the number of PC is passed.
  • Then, cells are divided into clusters through modularity optimization techniques (such as the Louvain algorithm) with the function FindClusters(), to which the resolution parameter is passed (indicating the granularity wanted for the clusters).
sn <- FindNeighbors(sn, dims = 1:18)
sn <- FindClusters(sn, resolution = 0.5)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 6644
Number of edges: 271593

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9206
Number of communities: 10
Elapsed time: 1 seconds

PCA visualization is not the best method for showing cluster of cells based on scRNA data. UMAP is usually used instead, which is a non-linear dimensionality reduction method. In order to visualize the data with UMAP dimensionality reduction, the function RunUMAP() from the package umap is used. The resulting clustering can then be plotted selecting the appropriate reduction method:

sn <- RunUMAP(sn, dims = 1:18)
DimPlot(sn, reduction = "umap")

Cluster markers

Through differential expression analysis, it’s possible to identify the genes that best define each cluster. The function FindAllMarkers() allows to find the DE genes for all the clusters,

sn <- BuildClusterTree(sn)
sn.markers <- FindAllMarkers(sn, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
cluster_genes <- sn.markers %>%
    group_by(cluster) %>%
    top_n(n = 5, wt = avg_log2FC)
cluster_genes

For each cluster, it is possible to visualize the specificity of a marker gene to the cluster in which it’s differentially expressed. In order to do that, two types of plots are useful:

  • Violinplots showing the distribution of gene expression for each cell in each cluster.
  • An heatmap showing in the UMAP plot directly how much a gene is expressed in each cluster.
VlnPlot(sn, features = c("IQCA1", "STMN4", "CERCAM", "KLK6", "OPALIN", "CSF3R", "VCAN",
    "SYT1", "FLT1", "HIGD1B"), pt.size = 0, ncol = 2)

FeaturePlot(sn, features = c("IQCA1", "STMN4", "CERCAM", "KLK6", "OPALIN", "CD74",
    "VCAN", "SYT1", "FLT1", "HIGD1B"), ncol = 2)

Cell type assignment

By manually searching for the marker genes expression in PanglaoDB search and in the literature, a cell type is assigned to each of the clusters. The result of this analysis can be seen in the UMAP plot by changing the cluster labels:

new.cluster.ids <- c("Astrocytes", rep("Oligodendrocytes", 4), "Microglia", "OPC",
    "Neurons", "Endothelial cells", "Pericytes")
names(new.cluster.ids) <- levels(sn)
sn <- RenameIdents(sn, new.cluster.ids)
DimPlot(sn, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

# Show the number of cells inside each cluster
summary(sn[["seurat_clusters"]], maxsum = 10)
 seurat_clusters
 0:1384         
 1: 976         
 2: 863         
 3: 754         
 4: 728         
 5: 711         
 6: 589         
 7: 270         
 8: 228         
 9: 141         

With this last plot, it’s possible to see that the different population are well clustered, showing a similar clustering to the ones reported in PanglaoDB and in the reference article.

Session info

R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Pop!_OS 20.10

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3

locale:
 [1] LC_CTYPE=it_IT.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=it_IT.UTF-8        LC_COLLATE=it_IT.UTF-8    
 [5] LC_MONETARY=it_IT.UTF-8    LC_MESSAGES=it_IT.UTF-8   
 [7] LC_PAPER=it_IT.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] umap_0.2.7.0       forcats_0.5.1      stringr_1.4.0      purrr_0.3.4       
 [5] readr_1.4.0        tidyr_1.1.3        tibble_3.1.2       ggplot2_3.3.5     
 [9] tidyverse_1.3.1    patchwork_1.1.1    dplyr_1.0.7        SeuratObject_4.0.2
[13] Seurat_4.0.3       rmdformats_1.0.2   knitr_1.33        

loaded via a namespace (and not attached):
  [1] readxl_1.3.1          backports_1.2.1       plyr_1.8.6           
  [4] igraph_1.2.6          lazyeval_0.2.2        splines_4.0.2        
  [7] listenv_0.8.0         scattermore_0.7       digest_0.6.27        
 [10] htmltools_0.5.1.1     fansi_0.5.0           magrittr_2.0.1       
 [13] tensor_1.5            cluster_2.1.0         ROCR_1.0-11          
 [16] globals_0.14.0        modelr_0.1.8          matrixStats_0.59.0   
 [19] askpass_1.1           spatstat.sparse_2.0-0 colorspace_2.0-2     
 [22] rvest_1.0.0           ggrepel_0.9.1         haven_2.4.1          
 [25] xfun_0.24             crayon_1.4.1          jsonlite_1.7.2       
 [28] spatstat.data_2.1-0   survival_3.2-11       zoo_1.8-9            
 [31] glue_1.4.2            polyclip_1.10-0       gtable_0.3.0         
 [34] leiden_0.3.8          future.apply_1.7.0    abind_1.4-5          
 [37] scales_1.1.1          DBI_1.1.1             miniUI_0.1.1.1       
 [40] Rcpp_1.0.7            viridisLite_0.4.0     xtable_1.8-4         
 [43] reticulate_1.20       spatstat.core_2.2-0   htmlwidgets_1.5.3    
 [46] httr_1.4.2            RColorBrewer_1.1-2    ellipsis_0.3.2       
 [49] ica_1.0-2             farver_2.1.0          pkgconfig_2.0.3      
 [52] sass_0.4.0            uwot_0.1.10           dbplyr_2.1.1         
 [55] deldir_0.2-10         utf8_1.2.1            labeling_0.4.2       
 [58] tidyselect_1.1.1      rlang_0.4.11          reshape2_1.4.4       
 [61] later_1.2.0           munsell_0.5.0         cellranger_1.1.0     
 [64] tools_4.0.2           cli_3.0.0             generics_0.1.0       
 [67] broom_0.7.8           ggridges_0.5.3        evaluate_0.14        
 [70] fastmap_1.1.0         yaml_2.2.1            goftest_1.2-2        
 [73] fs_1.5.0              fitdistrplus_1.1-5    RANN_2.6.1           
 [76] pbapply_1.4-3         future_1.21.0         nlme_3.1-149         
 [79] mime_0.11             formatR_1.11          xml2_1.3.2           
 [82] compiler_4.0.2        rstudioapi_0.13       plotly_4.9.4.1       
 [85] png_0.1-7             spatstat.utils_2.2-0  reprex_2.0.0         
 [88] bslib_0.2.5.1         stringi_1.6.2         highr_0.9            
 [91] RSpectra_0.16-0       lattice_0.20-41       Matrix_1.3-4         
 [94] vctrs_0.3.8           pillar_1.6.1          lifecycle_1.0.0      
 [97] spatstat.geom_2.2-0   lmtest_0.9-38         jquerylib_0.1.4      
[100] RcppAnnoy_0.0.18      data.table_1.14.0     cowplot_1.1.1        
[103] irlba_2.3.3           httpuv_1.6.1          R6_2.5.0             
[106] bookdown_0.22         promises_1.2.0.1      KernSmooth_2.23-17   
[109] gridExtra_2.3         parallelly_1.26.1     codetools_0.2-16     
[112] MASS_7.3-54           assertthat_0.2.1      openssl_1.4.4        
[115] withr_2.4.2           sctransform_0.3.2     mgcv_1.8-32          
[118] parallel_4.0.2        hms_1.1.0             grid_4.0.2           
[121] rpart_4.1-15          rmarkdown_2.9         Rtsne_0.15           
[124] shiny_1.6.0           lubridate_1.7.10